{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ldpQI0I1rvsc",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# Using MLFlow and Evidently to Evaluate Data Drift"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "S7H9fVO5rvsi",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "In this example, we will explore the MLflow integration with Evidently.\n",
    "\n",
    "This notebook shows how you can use the Evidently and MLflow to:\n",
    "* calculate data drift for the model, performed as batch checks \n",
    "* log data drift using MLflow Tracking\n",
    "* explore the result using MLflow UI\n",
    "\n",
    "Acknowledgments:\n",
    "* The dataset used in the example is from:  https://www.kaggle.com/c/bike-sharing-demand/data?select=train.csv\n",
    "* Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg\n",
    "* More information about the dataset can be found in UCI machine learning repository: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vrV5lg7hrvsj",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Getting Started¶\n",
    "To run this tutorial:\n",
    "\n",
    "1. Install MLflow\n",
    "You can install MLflow with the following command `pip install mlflow` or install MLflow with scikit-learn via `pip install mlflow[extras]`\n",
    "More details:https://mlflow.org/docs/latest/tutorials-and-examples/tutorial.html#id5\n",
    "\n",
    "2. Install Evidently\n",
    "You can install Evidently with the following command `pip install evidently`\n",
    "More details: https://docs.evidentlyai.com/install-evidently \n",
    "\n",
    "3. Optionally, you can load data from https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset and save in locally or skip this step and download data with  ```requests```  using instructions below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Qs05Qlgqrvsk",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "warnings.simplefilter('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Z8aCer8Yrvsm",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import json\n",
    "import pandas as pd\n",
    "import requests\n",
    "import zipfile\n",
    "import io\n",
    "\n",
    "from evidently.pipeline.column_mapping import ColumnMapping\n",
    "from evidently.report import Report\n",
    "from evidently.metric_preset import DataDriftPreset\n",
    "\n",
    "import mlflow\n",
    "import mlflow.sklearn\n",
    "from mlflow.tracking import MlflowClient"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pbXppBj_rvsn",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "More information about the dataset can be found in Kaggle Playground Competition: https://www.kaggle.com/c/bike-sharing-demand/data?select=train.csv\n",
    "\n",
    "Acknowledgement: Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "oNrm-FZirvso",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "#load data\n",
    "content = requests.get(\"https://archive.ics.uci.edu/static/public/275/bike+sharing+dataset.zip\").content\n",
    "with zipfile.ZipFile(io.BytesIO(content)) as arc:\n",
    "    raw_data = pd.read_csv(arc.open(\"day.csv\"), header=0, sep=',', parse_dates=['dteday'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "d0rSZmyOrvso",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "#observe data structure\n",
    "raw_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "6Xa9u5P7rvsp",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "#set column mapping for Evidently Profile\n",
    "data_columns = ColumnMapping()\n",
    "data_columns.datetime = 'dteday'\n",
    "data_columns.numerical_features = ['weathersit', 'temp', 'atemp', 'hum', 'windspeed']\n",
    "data_columns.categorical_features = ['holiday', 'workingday']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "qUvh4bd-rvsp",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "#evaluate data drift with Evidently Profile\n",
    "def eval_drift(reference, production, column_mapping):\n",
    "    \"\"\"\n",
    "    Returns a list with pairs (feature_name, drift_score)\n",
    "    Drift Score depends on the selected statistical test or distance and the threshold\n",
    "    \"\"\"    \n",
    "    data_drift_report = Report(metrics=[DataDriftPreset()])\n",
    "    data_drift_report.run(reference_data=reference, current_data=production, column_mapping=column_mapping)\n",
    "    report = data_drift_report.as_dict()\n",
    "\n",
    "    drifts = []\n",
    "\n",
    "    for feature in column_mapping.numerical_features + column_mapping.categorical_features:\n",
    "        drifts.append((feature, report[\"metrics\"][1][\"result\"][\"drift_by_columns\"][feature][\"drift_score\"]))\n",
    "\n",
    "    return drifts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "S1CXPhMcrvsq",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "#set reference dates\n",
    "reference_dates = ('2011-01-01 00:00:00','2011-01-28 23:00:00')\n",
    "\n",
    "#set experiment batches dates\n",
    "experiment_batches = [\n",
    "    ('2011-01-01 00:00:00','2011-01-29 23:00:00'),\n",
    "    ('2011-01-29 00:00:00','2011-02-07 23:00:00'),\n",
    "    ('2011-02-07 00:00:00','2011-02-14 23:00:00'),\n",
    "    ('2011-02-15 00:00:00','2011-02-21 23:00:00'),  \n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "byVxDBrYrvsr",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "#log into MLflow\n",
    "client = MlflowClient()\n",
    "\n",
    "#set experiment\n",
    "mlflow.set_experiment('Data Drift Evaluation with Evidently')\n",
    "\n",
    "#start new run\n",
    "for date in experiment_batches:\n",
    "    with mlflow.start_run() as run: #inside brackets run_name='test'\n",
    "        \n",
    "        # Log parameters\n",
    "        mlflow.log_param(\"begin\", date[0])\n",
    "        mlflow.log_param(\"end\", date[1])\n",
    "\n",
    "        # Log metrics\n",
    "        metrics = eval_drift(raw_data.loc[raw_data.dteday.between(reference_dates[0], reference_dates[1])], \n",
    "                             raw_data.loc[raw_data.dteday.between(date[0], date[1])], \n",
    "                             column_mapping=data_columns)\n",
    "        for feature in metrics:\n",
    "            mlflow.log_metric(feature[0], round(feature[1], 3))\n",
    "\n",
    "        print(run.info)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5MKvdY9Nrvsr",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "#run MLflow UI (it will be more convinient to run it directly from the terminal)\n",
    "#!mlflow ui"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Support Evidently\n",
    "Did you find the example useful? Star Evidently on GitHub to contribute back! This helps us continue creating free open-source tools for the community. https://github.com/evidentlyai/evidently"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
